1 Abstract

This report presents an initial analysis of Project STAR data, exploring the relationship between class type and student math performance in 1st grade. The study also examines how teacher-student ethnicity overlap affects student achievement. Descriptive statistics, visualizations, and summary measures are provided to identify trends in student performance across different class types and teacher attributes. Future analysis will involve inferential methods to determine significant associations and causal relationships.

2 Introduction

2.1 Questions of interest

The following questions are copied from the official course project description. Further questions with specificity will be added in the final report.

Primary question: Are there any differences in math scaled scores in 1st grade across class types?

Secondary question: If there are differences, which class type is associated with the highest math scaled scores in 1st grade?

3 Background

4 Descriptive analysis

4.1 Data Import and Processing

library(dplyr)
library(ggplot2)
library(ggthemes)
library(knitr)
library(kableExtra)
library(patchwork)
library(car)
library(MASS)
library(broom)
library(AER)
data("STAR")

For the initial report, we pull in the Project STAR data from the AER package. The dataset contains 11,598 student records with 47 columns.

4.2 Initial Variable Selection

data <- STAR %>% dplyr::select(c(ethnicity), 
                               contains("1"), 
                               -c(read1))

# Move math1 variable to the first column
data <- data %>% dplyr::select(math1, everything())

Since we are only interested in the math scores of students in 1st grade, we select all columns that contain “1” in the variable name, which indicates the first grade. We also include the ethnicity of students for further analysis. The read1 variable is excluded as it is not relevant to the primary question of interest. At this stage, we are left with 11 columns.

The variables we are interested in are as follows:

  • math1: Math scaled score in 1st grade (outcome variable)
  • star1: Class type in 1st grade
  • schoolid1: School ID in 1st grade
  • ethnicity & tethnicity1: Student and teacher ethnicities (utilized to create ethnicity_overlap value for each class)
  • lunch1: Indicator of whether student received free lunch in 1st grade (translates to economic status of student)
  • degree1: Teacher’s highest education level
  • experience1: Teacher’s teaching experience
  • ladder1: Teacher’s career ladder level

4.3 Addressing Missing Values

# Plot Percentage of missing values for each variable

# Create a data frame with missing value percentages
missing_values <- data.frame(
  variable = names(data),
  missing = colMeans(is.na(data)) * 100
)

# Convert variable to factor with reversed levels
missing_values$variable <- factor(missing_values$variable, levels = rev(missing_values$variable))

# Plot without ordering and display percentages on the bars
ggplot(missing_values, aes(x = variable, y = missing)) +
  geom_bar(stat = "identity", fill = "#4AA2D0") +
  geom_text(aes(label = sprintf("%.1f%%", missing)), hjust = 1, size = 4, color='#263238') +
  coord_flip() +
  labs(title = "Percentage of Missing Values by Variable",
       x = "Variable",
       y = "Percentage Missing") +
  theme_minimal() +
  theme(plot.title = element_text(size = 11))

We observe that the math1 variable has missing values (approximately 43.1%), as well as other variables of interest. For now, we only remove rows with missing values in the math1 variable. Other variables may still contain missing values, which will be addressed in the final report before being incorporated into a model/analysis.

# Remove rows with missing values in math1

data <- data %>% filter(!is.na(math1))

# Plot Percentage of missing values for each variable

# Create a data frame with missing value percentages
missing_values <- data.frame(
  variable = names(data),
  missing = colMeans(is.na(data)) * 100
)

# Convert variable to factor with reversed levels
missing_values$variable <- factor(missing_values$variable, levels = rev(missing_values$variable))

# Plot without ordering and display percentages on the bars
ggplot(missing_values, aes(x = variable, y = missing)) +
  geom_bar(stat = "identity", fill = "#4AA2D0") +
  geom_text(aes(label = sprintf("%.1f%%", missing)), hjust = -0.1, size = 4, color='#263238') +
  coord_flip() +
  labs(title = "Percentage of Missing Values by Variable (after removal of missing values in math1)",
       x = "Variable",
       y = "Percentage Missing") +
  theme_minimal() +
  theme(plot.title = element_text(size = 11))

After removing rows with missing values in the math1 variable, we are left with 6,600 student records.

4.4 Creating unique teacher IDs (will not be required with data from Harvard dataverse)

# Attach unique teacher_id variable to the data based on combinations of star1, experience1, tethnicity1, schoolid1
data <- data %>% mutate(teacher_id = paste(star1, experience1, tethnicity1, schoolid1, sep = "_"))

# Check the number of unique teacher IDs
unique_teachers <- unique(data$teacher_id)

To address the lack of a given teacher/class ID in the dataset from the AER package, we created a unique teacher_id variable based on the combination of star1, experience1, tethnicity1, and schoolid1. There are 366 unique teacher IDs (number of classes) in the dataset. This variable will be used to aggregate students’ performance in the following steps.

4.5 Creating new variable of interest: Teacher-Student Ethnicity Overlap Percentage

# Compute number of students by ethnicity and their respective percentage
student_ethnicities <- data %>%
  group_by(ethnicity) %>%
  summarise(
    count = n(),  # Number of students per ethnicity
    percentage = (n() / nrow(data)) * 100  # Convert to percentage
  ) %>%
  arrange(desc(count))  # Sort by count in descending order

# Display as a formatted table
student_ethnicities %>%
  kable(caption = "Table 1: Number of Students by Ethnicity and Percentage") %>%
  kable_styling(full_width = FALSE)
Table 1: Number of Students by Ethnicity and Percentage
ethnicity count percentage
cauc 4402 66.6969697
afam 2153 32.6212121
asian 19 0.2878788
other 11 0.1666667
hispanic 9 0.1363636
amindian 4 0.0606061
NA 2 0.0303030
# Compute number of teachers by ethnicity and their respective percentage
teacher_ethnicities <- data %>%
  group_by(tethnicity1) %>%
  summarise(
    count = n(),  # Number of students per ethnicity
    percentage = (n() / nrow(data)) * 100  # Convert to percentage
  ) %>%
  arrange(desc(count))  # Sort by count in descending order

# Display as a formatted table
teacher_ethnicities %>%
  kable(caption = "Table 2: Number of Teachers by Ethnicity and Percentage") %>%
  kable_styling(full_width = FALSE)
Table 2: Number of Teachers by Ethnicity and Percentage
tethnicity1 count percentage
cauc 5420 82.1212121
afam 1138 17.2424242
NA 42 0.6363636

We notice that there is an imbalance in the number/proportion of ethnicities in both the student and teacher populations. We also note that there are 6 unique ethnicity values for students (caucasian, african american, asian, hispanic, native american, and other) while there are only 2 unique ethnicity values for teachers (caucasian, african american). The lack of variety in teacher ethnicities could be attributed to the location and time period of the study, where the teacher population may have been predominantly caucasian and african american.

Rather than taking the ethnicity or tethnicity1 variable as is (the ethnicity itself of students and teachers are not of primary interest), we want to investigate if a teacher and student sharing the same ethnicity affects student performance for that class overall. In order to answer this question, we create a new variable ethnicity_overlap that measures the percentage of students in a class that share the same ethnicity as the teacher. This variable will allow us to take a step in quantifying any presence of biased or unbiased teaching practices based on shared characteristics betweeen teachers and students.

# First create a binary variable column indicating whether the student and teacher ethnicity match
data <- data %>%
  mutate(
    ethnicity = as.character(ethnicity),
    tethnicity1 = as.character(tethnicity1),
    ethnicity_match = ifelse(ethnicity == tethnicity1, 1, 0)  # Create binary match variable
  )
         
teacher_ethnicity_overlap <- data %>%
  group_by(teacher_id) %>%
  summarise(ethnicity_overlap = mean(ethnicity_match, na.rm = TRUE) * 100)  # Convert to percentage

data <- data %>%
  left_join(teacher_ethnicity_overlap, by = "teacher_id")

head(data, 10) %>% 
  dplyr::select(teacher_id, ethnicity_overlap, tethnicity1) %>%
  kable(caption = "Table 3: Teacher-Student Ethnicity Overlap Percentage (first 10 rows)") %>%
  kable_styling(full_width = FALSE)
Table 3: Teacher-Student Ethnicity Overlap Percentage (first 10 rows)
teacher_id ethnicity_overlap tethnicity1
small_7_cauc_63 86.66667 cauc
small_32_afam_20 100.00000 afam
regular+aide_8_cauc_5 95.65217 cauc
regular_7_cauc_50 90.47619 cauc
regular_11_cauc_69 100.00000 cauc
small_15_cauc_79 95.00000 cauc
regular_0_cauc_5 90.00000 cauc
regular_5_cauc_16 0.00000 cauc
regular_17_cauc_48 100.00000 cauc
regular_1_afam_51 52.38095 afam

Further visualizations and analysis on this newly created variable will be conducted in the following sections.

4.6 Student Performance by Teacher and Appropriate Summary Measure

# Aggregate students' performance by teacher

teacher_performance <- data %>% 
  group_by(teacher_id) %>% 
  summarise(mean_math1 = mean(math1, na.rm = TRUE),
            median_math1 = median(math1, na.rm = TRUE),
            sd_math1 = sd(math1, na.rm = TRUE),
            q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
            q3_math1 = quantile(math1, 0.75, na.rm = TRUE),
            n = n())

# Display the first few rows of the aggregated data
head(teacher_performance, 10) %>%
  kable(caption = "Table 4: Aggregated Student Performance by Teacher (first 10 rows)") %>%
  kable_styling(full_width = FALSE)
Table 4: Aggregated Student Performance by Teacher (first 10 rows)
teacher_id mean_math1 median_math1 sd_math1 q1_math1 q3_math1 n
regular+aide_0_cauc_48 495.0000 487.0 34.52304 475.25 509.25 26
regular+aide_0_cauc_77 548.5455 542.0 29.30907 529.75 578.00 22
regular+aide_10_cauc_1 545.9259 545.0 49.54402 501.00 584.00 27
regular+aide_10_cauc_32 497.9130 490.0 26.40405 482.50 516.50 23
regular+aide_10_cauc_67 536.5500 542.0 33.15272 512.00 554.00 20
regular+aide_10_cauc_9 544.7273 547.0 34.10095 535.00 560.75 22
regular+aide_11_afam_49 517.6538 518.0 32.55757 495.00 535.00 26
regular+aide_11_cauc_13 551.3500 545.5 39.88177 528.25 578.00 20
regular+aide_11_cauc_36 533.1364 524.5 35.25053 507.00 555.00 22
regular+aide_11_cauc_37 535.9500 519.0 58.57831 491.75 572.00 20
# Distribution of the mean math scores by teacher
p1 <- ggplot(teacher_performance, aes(x = mean_math1)) +
  geom_histogram(fill = "#4AA2D0", color = "#263238", bins = 30) +
  labs(title = "Distribution of Mean Math Scores by Teacher/Class",
       x = "Mean Math Score",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(size = 10))

# Distribution of the median math scores by teacher
p2 <- ggplot(teacher_performance, aes(x = median_math1)) +
  geom_histogram(fill = "#2B5798", color = "#263238", bins = 30) +
  labs(title = "Distribution of Median Math Scores by Teacher/Class",
       x = "Median Math Score",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(size = 10))

# Display the two plots side by side
p1 + plot_spacer() + p2 + plot_layout(widths = c(1, 0.3, 1))

# Quantile-quantile plot of the mean math scores by teacher
qqPlot(teacher_performance$mean_math1, distribution = "norm", main = "Q-Q Plot of Mean Math Scores by Teacher")

## [1] 171  20

We observe that the distribution of mean student math scores by teacher is approximately normal and there are no visually significant outliers (due to fixed minimum and maximum scores). The data being approximately normal is further supported with the quantile-quantile plot, where the points lie close to the line. The mean is a better summary measure than the median in this situation, given that its normal distribution is more apparent, unlike the median that displays slightly more variability across bin levels.

4.7 Univariate Descriptive Statistics of Selected Variables

This section provides visual and tabular summaries of the selected variables of interest.

4.7.1 STAR Class Type (star1)

class_types_count <- data %>%
  group_by(star1) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

ggplot(class_types_count, aes(x = reorder(star1, count), y = count)) +
  geom_bar(stat = "identity", fill = "#00D29A") +
  geom_text(aes(label = count), hjust = 1.5, size = 4, color='#263238') +  # Adjust text position
  coord_flip() +  # Rotate chart
  labs(title = "Number of Students by STAR Class Type",
       x = "Number of Students",  # Swap x and y labels
       y = "STAR Class Type") +
  theme_minimal()

While not perfectly balanced, the distribution of students across class types is relatively even. The majority of students are in regular classes, followed by regular classes with an aide, and then small classes. Taking note of the overall and class-wise student size may be important for future analysis such as testing for homogeneity of variance in ANOVA.

4.7.2 School ID (schoolid1)

# Bar plot of teacher count (teacher_id) per schoolid1

# Bar plot of teacher count per school
teacher_count_per_school <- data %>%
  group_by(schoolid1) %>%
  summarise(teacher_count = n())

ggplot(teacher_count_per_school, aes(x = factor(schoolid1), y = teacher_count)) +
  geom_bar(stat = "identity", fill = "#00D29A", color = "#263238") +
  labs(title = "Number of Teachers per School",
       x = "School ID",
       y = "Teacher Count") +
  theme_minimal()

The distribution of teachers across schools is relatively balanced, with a few schools having more teachers than others. Overall, this means that we have a good spread of teachers across schools, which is important for generalizability of results.

4.7.3 Ethnicity Overlap (ethnicity_overlap)

# Compute summary statistics for ethnicity_overlap
ethnicity_overlap_summary <- data %>%
  summarise(
    min = min(ethnicity_overlap, na.rm = TRUE),
    q1 = quantile(ethnicity_overlap, 0.25, na.rm = TRUE),
    median = median(ethnicity_overlap, na.rm = TRUE),
    mean = mean(ethnicity_overlap, na.rm = TRUE),
    q3 = quantile(ethnicity_overlap, 0.75, na.rm = TRUE),
    max = max(ethnicity_overlap, na.rm = TRUE),
    sd = sd(ethnicity_overlap, na.rm = TRUE),
    missing = sum(is.na(experience1))
  )

# Display the summary statistics in a formatted table
ethnicity_overlap_summary %>%
  kable(caption = "Table 5: Summary Statistics for Ethnicity Overlap") %>%
  kable_styling(full_width = FALSE)
Table 5: Summary Statistics for Ethnicity Overlap
min q1 median mean q3 max sd missing
0 71.42857 93.33333 78.15416 100 100 31.9979 12
# Boxplot
p11 <- ggplot(data, aes(y = ethnicity_overlap)) +
  geom_boxplot(fill = "#00D29A", color = "#263238", outlier.color = "red", outlier.shape = 16) +
  labs(title = "Boxplot of Teacher-Student Ethnicity Overlap",
       y = "Ethnicity Overlap (%)") +
  theme_minimal() +
  theme(plot.title = element_text(size = 10))

# Histogram
p12 <- ggplot(data, aes(x = ethnicity_overlap)) +
  geom_histogram(fill = "#00D29A", color = "#263238", bins = 30) +
  labs(title = "Distribution of Teacher-Student Ethnicity Overlap", 
       x = "Ethnicity Overlap (%)", 
       y = "Frequency") + 
  theme_minimal() +
  theme(plot.title = element_text(size = 10))

p11 | p12
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_bin()`).

For the newly created variable ethnicity_overlap, we observe that the distribution is heavily skewed with a majority of classes having the same ethnicity of students and teachers (100% overlap) and there is a noticeable amount of classes with 0% overlap. This variable will require a deeper investigation when analyzing its relationship with student performance but may lead to interesting insights on the impact of shared characteristics between teachers and students.

4.7.4 Economic Status (lunch1)

# bar plot of lunch1
lunch_count <- data %>%
  group_by(lunch1) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

ggplot(lunch_count, aes(x = reorder(lunch1, count), y = count)) +
  geom_bar(stat = "identity", fill = "#00D29A") +
  geom_text(aes(label = count), hjust = 1.1, size = 4, color='#263238') +  # Adjust text position
  coord_flip() +  # Rotate chart
  labs(title = "Number of Students by Economic Status",
       x = "Number of Students",  # Swap x and y labels
       y = "Economic Status") +
  theme_minimal()

While there are no direct indicators of students’ family economic status, we can utilize the lunch1 variable as a proxy. The distribution of students across economic status is relatively balanced, with a very slight majority of students receiving free lunch. This variable will be important to consider when analyzing the relationship between economic status and student performance.

4.7.5 Teacher’s Highest Education Level (degree1)

# Bar plot of degree1
degree_count <- data %>%
  group_by(degree1) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

ggplot(degree_count, aes(x = reorder(degree1, count), y = count)) +
  geom_bar(stat = "identity", fill = "#00D29A") +
  geom_text(aes(label = count), hjust = 0.2, size = 4, color='#263238') +  # Adjust text position
  coord_flip() +  # Rotate chart
  labs(title = "Number of Teachers by Highest Education Level",
       x = "Number of Teachers",  # Swap x and y labels
       y = "Highest Education Level") +
  theme_minimal()

We notice that the majority of teachers have a bachelor’s degree, followed by a master’s degree. The lack of teachers in other education levels may limit the capacity to analyze the impact of higher education levels on student performance at a more granular level. However, this variable will still be able to provide insight into whether or not a teacher’s graduate degree makes a significant difference in student performance.

4.7.6 Teacher’s Teaching Experience (experience1)

# Table of summary statistics for experience1 including missing values
experience_summary <- data %>%
  summarise(
    min = min(experience1, na.rm = TRUE),
    q1 = quantile(experience1, 0.25, na.rm = TRUE),
    median = median(experience1, na.rm = TRUE),
    mean = mean(experience1, na.rm = TRUE),
    q3 = quantile(experience1, 0.75, na.rm = TRUE),
    max = max(experience1, na.rm = TRUE),
    sd = sd(experience1, na.rm = TRUE),
    missing = sum(is.na(experience1))
  )

kable(experience_summary, caption = "Table 6: Summary Statistics for Teaching Experience") %>%
  kable_styling(full_width = FALSE)
Table 6: Summary Statistics for Teaching Experience
min q1 median mean q3 max sd missing
0 4 10 11.62508 17 42 8.922621 12
# Boxplot of experience1

p9 <- ggplot(data, aes(y = experience1)) +
  geom_boxplot(fill = "#00D29A", color = "#263238", outlier.color = "red", outlier.shape = 16) +
  labs(title = "Boxplot of Teacher's Teaching Experience",
       y = "Teaching Experience") +
  theme_minimal() +
  theme(plot.title = element_text(size = 11))

# Histogram of experience1

p10 <- ggplot(data, aes(x = experience1)) +
  geom_histogram(fill = "#00D29A", color = "#263238", bins = 30) +
  labs(title = "Distribution of Teacher's Teaching Experience",
       x = "Teaching Experience",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(size = 11))

p9 | p10
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

We note the presence of some outlier teachers who have a significantly longer teaching experience than others. Overall, the distribution is slightly right-skewed, with a majority of teachers having between 4 to 17 years of experience.

4.7.7 Teacher’s Career Ladder Level (ladder1)

# Bar plot of ladder1

ladder_count <- data %>%
  group_by(ladder1) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

ggplot(ladder_count, aes(x = reorder(ladder1, count), y = count)) +
  geom_bar(stat = "identity", fill = "#00D29A") +
  geom_text(aes(label = count), hjust = 0.2, size = 4, color='#263238') +  # Adjust text position
  coord_flip() +  # Rotate chart
  labs(title = "Number of Teachers by Career Ladder Level",
       x = "Number of Teachers",  # Swap x and y labels
       y = "Career Ladder Level") +
  theme_minimal()

The majority of teachers are in the level 1 career ladder. Given the lack of teachers in other career ladder levels, the teacher’s experience level may be a more appropriate variable to analyze the impact of teacher career progression on student performance.

4.8 Multivariate Descriptive Statistics of Selected Variables

4.8.1 Math Scores (math1) v.s. STAR Class Type (star1)

class_types <- data %>%
  group_by(star1) %>%
  summarise(mean_math1 = mean(math1, na.rm = TRUE),
            median_math1 = median(math1, na.rm = TRUE),
            sd_math1 = sd(math1, na.rm = TRUE),
            q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
            q3_math1 = quantile(math1, 0.75, na.rm = TRUE))
            
class_types %>%
  kable(caption = "Table 7: Summary Statistics of Math Scores by STAR Class Type") %>%
  kable_styling(full_width = FALSE)
Table 7: Summary Statistics of Math Scores by STAR Class Type
star1 mean_math1 median_math1 sd_math1 q1_math1 q3_math1
regular 525.2744 523 41.65123 495.00 553
small 538.6777 535 44.10308 509.25 567
regular+aide 529.6252 529 42.86598 497.00 557
ggplot(data, aes(x = star1, y = math1)) +
  geom_boxplot(fill = "#AFAAFF", color = "#263238") +
  labs(title = "Math Scores by STAR Class Type",
       x = "STAR Class Type",
       y = "Math Score") +
  theme_minimal()

The side-by-side boxplot of math scores by class type shows that students have similar distributions of math scores across different class types. While we do see a slightly higher mean, median, and interval range for students in small classes, we are unable to confirm any significant differences in student performance across class types at this stage. Hence this motivates the use of an ANOVA test and Tukey’s HSD test to confirm if there are any significant differences in student performance across class types.

4.8.2 Math Scores (math1) v.s. School ID (schoolid1)

# Summary statistics of math scores by schoolid1
schoolid_count <- data %>%
  group_by(schoolid1) %>%
  summarise(mean_math1 = mean(math1, na.rm = TRUE),
            median_math1 = median(math1, na.rm = TRUE),
            sd_math1 = sd(math1, na.rm = TRUE),
            q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
            q3_math1 = quantile(math1, 0.75, na.rm = TRUE))

head(schoolid_count, 10) %>%
  kable(caption = "Table 8: Summary Statistics of Math Scores by School ID (first 10 rows)") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Table 8: Summary Statistics of Math Scores by School ID (first 10 rows)
schoolid1 mean_math1 median_math1 sd_math1 q1_math1 q3_math1
1 536.8969 538 42.58753 505.00 567.00
2 493.7500 489 28.39822 474.00 515.00
3 558.8818 555 41.22410 535.00 582.50
4 524.2714 526 39.07923 500.00 548.00
5 517.1892 512 32.66756 497.00 534.25
7 552.6102 553 50.48619 529.00 590.00
8 526.5161 526 34.76649 502.00 549.00
9 551.2000 551 40.99771 526.75 582.50
10 567.8966 572 46.50097 542.75 601.00
11 550.9351 549 51.75406 518.00 578.00
# Boxplots of math scores by schoolid1
ggplot(data, aes(x = factor(schoolid1), y = math1)) +
  geom_boxplot(fill = "#AFAAFF", color = "#263238") +
  labs(title = "Math Scores by School ID",
       x = "School ID",
       y = "Math Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Scatterplot of math scores by schoolid1
ggplot(data, aes(x = factor(schoolid1), y = math1)) +
  geom_jitter(color = "#004D33", alpha = 0.6, width = 0.1, size = 0.9) +
  labs(title = "Scatterplot of Math Scores by School ID",
       x = "School ID",
       y = "Math Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

There is no distinct pattern in terms of student performance across schools. In fact, student performance seems to vary widely within each school. Clearly, some schools have higher performing students than others, and also warrant further investigation to determine if there are any significant differences in student performance across schools.

4.8.3 Math Scores (math1) v.s. Ethnicity Overlap (ethnicity_overlap)

# Scatterplot of math scores against ethnicity overlap
ggplot(data, aes(x = ethnicity_overlap, y = math1)) +
  geom_jitter(color = "#AFAAFF", alpha = 0.6, size = 1, width = 0.1, height = 0.1) +  # Reduce overlap
  labs(title = "Scatterplot of Math Scores vs. Ethnicity Overlap",
       x = "Ethnicity Overlap",
       y = "Math Score") +
  theme_minimal()
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Visually, there seems to be a wider tange of math scores in classes with either 0% or 100% overlap in teacher and student ethnicity. This is a very interesting trend that will be further explored in the inferential analysis to determine if there is any differences that arise in classrooms where teachers and students share ethnicities.

4.8.4 Math Scores (math1) v.s. Economic Status (lunch1)

lunch_count <- data %>%
  group_by(lunch1) %>%
  summarise(mean_math1 = mean(math1, na.rm = TRUE),
            median_math1 = median(math1, na.rm = TRUE),
            sd_math1 = sd(math1, na.rm = TRUE),
            q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
            q3_math1 = quantile(math1, 0.75, na.rm = TRUE))
            
lunch_count %>%
  kable(caption = "Table 9: Summary Statistics of Math Scores by Economic Status") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Table 9: Summary Statistics of Math Scores by Economic Status
lunch1 mean_math1 median_math1 sd_math1 q1_math1 q3_math1
non-free 544.9734 545 41.54842 518 572
free 516.7250 515 39.98681 488 542
NA 528.0798 526 41.89183 502 553
ggplot(data, aes(x = lunch1, y = math1)) +
  geom_boxplot(fill = "#AFAAFF", color = "#263238") +
  labs(title = "Math Scores by Economic Status",
       x = "Economic Status",
       y = "Math Score") +
  theme_minimal()

ggplot(data, aes(x = star1, y = math1, fill = lunch1)) +
  geom_boxplot() +
  labs(title = "Math Scores by Economic Status",
       x = "Economic Status",
       y = "Math Score") +
  theme_minimal() +
  scale_fill_manual(values = c("#AFAAFF", "#4D3898"))

Overall, we notice that students who qualify for free lunch have higher mean, median, and interquartile range math scores compared to students who do not qualify for free lunch, even across each class type. This is an interesting trend that will be further explored.

4.8.5 Math Scores (math1) v.s. Teacher’s Degree Level (degree1)

degree_count <- data %>%
  group_by(degree1) %>%
  summarise(mean_math1 = mean(math1, na.rm = TRUE),
            median_math1 = median(math1, na.rm = TRUE),
            sd_math1 = sd(math1, na.rm = TRUE),
            q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
            q3_math1 = quantile(math1, 0.75, na.rm = TRUE))

degree_count %>%
  kable(caption = "Table 10: Summary Statistics of Math Scores by Degree Level") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Table 10: Summary Statistics of Math Scores by Degree Level
degree1 mean_math1 median_math1 sd_math1 q1_math1 q3_math1
bachelor 528.9805 526.0 42.53398 500.0 557.0
master 533.3354 532.0 44.18649 502.0 562.0
specialist 535.6486 529.0 43.12528 518.0 562.0
phd 533.1364 524.5 35.25053 507.0 555.0
NA 548.5000 549.0 26.78704 536.5 564.5
ggplot(data, aes(x = degree1, y = math1)) +
  geom_boxplot(fill = "#AFAAFF", color = "#263238") +
  labs(title = "Math Scores by Teacher's Degree Level",
       x = "Teacher's Degree Level",
       y = "Math Score") +
  theme_minimal()

The distribution of math scores across teacher degree levels is relatively similar. We will definitely need to conduct further analysis to determine if there are any significant differences in student performance across teacher degree levels.

4.8.6 Math Scores (math1) v.s. Teacher’s Teaching Experience (experience1)

# Scatterplot of math scores against teaching experience
ggplot(data, aes(x = experience1, y = math1)) +
  geom_jitter(color = "#AFAAFF", alpha = 0.6, size = 1, width = 0.1, height = 0.1) +  # Reduce overlap
  labs(title = "Scatterplot of Math Scores vs. Teaching Experience",
       x = "Teaching Experience",
       y = "Math Score") +
  theme_minimal()
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

There is no clear pattern or visual association between teacher’s teaching experience and student math scores.

4.8.7 Math Scores (math1) v.s. Teacher’s Career Ladder Level (ladder1)

ladder_count <- data %>%
  group_by(ladder1) %>%
  summarise(mean_math1 = mean(math1, na.rm = TRUE),
            median_math1 = median(math1, na.rm = TRUE),
            sd_math1 = sd(math1, na.rm = TRUE),
            q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
            q3_math1 = quantile(math1, 0.75, na.rm = TRUE))

ladder_count %>%
  kable(caption = "Table 11: Summary Statistics of Math Scores by Career Ladder Level") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Table 11: Summary Statistics of Math Scores by Career Ladder Level
ladder1 mean_math1 median_math1 sd_math1 q1_math1 q3_math1
level1 532.1803 529 43.17080 502.0 562
level2 529.8019 523 48.59501 495.5 557
level3 541.8566 538 45.44109 515.0 572
apprentice 528.4583 526 43.03445 500.0 553
probation 521.3891 520 39.11454 493.0 549
notladder 526.5183 529 41.82333 495.0 557
NA 501.0909 488 43.85513 468.0 538
ggplot(data, aes(x = ladder1, y = math1)) +
  geom_boxplot(fill = "#AFAAFF", color = "#263238") +
  labs(title = "Math Scores by Teacher's Career Ladder Level",
       x = "Teacher's Career Ladder Level",
       y = "Math Score") +
  theme_minimal()

While level 3 teachers have a slightly higher mean math score compared to other career ladder levels, the distribution of math scores across career ladder levels is also relatively similar. Further analysis will be required to determine if there are any significant differences in student performance across teacher career ladder levels.

5 Inferential analysis

5.1 Model Choice (for initial analysis)

The model choice for the initial analysis is a two-way ANOVA model with the following structure: \[Y_{ijk} = \mu_{..} + \alpha_{i} + \beta_{j} + \epsilon_{ijk}\] where the index \(i\) represents the class type: small (\(i=1\)), regular (\(i=2\)), regular with aide (\(i=3\)), and the index \(j\) represents the school indicator. The rest of the parameters are as follows:

  • \(Y_{ijk}\): Mean math score of student \(k\) in class \(i\) at school \(j\).
  • \(\mu_{..}\): Overall mean math score
  • \(\alpha_{i}\): Main effect of class type \(i\) with constraint \(\sum_{i=1} \alpha_{i} = 0\)
  • \(\beta_{j}\): Main effect of school \(j\) with constraint \(\sum_{j=1} \beta_{j} = 0\)
  • \(\epsilon_{ijk}\): Random error in the math score of student \(k\) in class \(i\) at school \(j\)

While we only have 3 class types, there are 80 unique school IDs in the dataset. Given the large number of school IDs, including the interaction term \((\alpha \beta)_{ij}\) significantly increases the complexity of the model. Further justification in excluding the interaction term is that the primary question of interest is to determine if there are differences in math scaled scores across class types, rather than the effect of class type varying across schools. This exclusion of the interaction term is also supported in the following sections through the interaction plot.

5.2 Model Assumptions

The assumptions of the two-way ANOVA model are as follows:

  • Independence: The residuals \(\epsilon_{ijk}\) are independent of each other
  • Normality: The residuals \(\epsilon_{ijk}\) are normally distributed
  • Homoscedasticity: The variance of the residuals \(\epsilon_{ijk}\) is constant across all levels of the independent variables

5.3 Model Justification

5.3.1 Main Effects of Class Type and School ID

# Main effect plot of class type
class_type_effect <- data %>%
  group_by(star1) %>%
  summarise(mean_math1 = mean(math1, na.rm = TRUE),
            SE = sd(math1, na.rm = TRUE) / sqrt(n()),
            LowerCI = mean_math1 - qt(0.975, n() - 1) * SE,
            UpperCI = mean_math1 + qt(0.975, n() - 1) * SE)

p5 <- ggplot(class_type_effect, aes(x = star1, y = mean_math1, group = 1)) +
  geom_point(size = 1, color = "#263238") +
  geom_line(color = "#607D8B") +
  geom_errorbar(aes(ymin = LowerCI, ymax = UpperCI), width = 0.2, color = "#263238") +
  labs(title = "Main Effect of Class Type on Math Scores",
       x = "Class Type",
       y = "Mean Math Score") +
  theme_minimal()

# Main effect plot of school ID
school_id_effect <- data %>%
  group_by(schoolid1) %>%
  summarise(mean_math1 = mean(math1, na.rm = TRUE),
            SE = sd(math1, na.rm = TRUE) / sqrt(n()),
            LowerCI = mean_math1 - qt(0.975, n() - 1) * SE,
            UpperCI = mean_math1 + qt(0.975, n() - 1) * SE)

p6 <- ggplot(school_id_effect, aes(x = factor(schoolid1), y = mean_math1, group = 1)) +
  geom_point(size = 1, color = "#263238") +
  geom_line(color = "#607D8B") +
  geom_errorbar(aes(ymin = LowerCI, ymax = UpperCI), width = 0.2, color = "#263238") +
  labs(title = "Main Effect of School ID on Math Scores",
       x = "School ID",
       y = "Mean Math Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

p5 + p6 + plot_layout(widths = c(1,3))+
  plot_annotation(title = "Main Effects Plots") & 
  theme(plot.title = element_text(hjust = 0.5))

The two main effects plots show the mean math scores for each class type and school ID. The error bars represent the 95% confidence intervals for the mean math scores. The main effect of class type on math scores is evident, with the small class type having the largest mean math score. The main effect of school ID on math scores is less clear, with a wide range of mean math scores across school IDs.

5.3.2 Interaction Plot

# Interaction plot of class type and school ID

data_interaction <- data %>%
  group_by(star1, schoolid1) %>%
  summarise(mean_math1 = mean(math1, na.rm = TRUE))

ggplot(data_interaction, aes(x = schoolid1, y = mean_math1, color = star1, group = star1)) +
  geom_point(size = 2, alpha = 0.5) +           
  geom_line(size = 1, alpha = 0.5) +            
  labs(title = "Interaction Plot of Class Type and School ID",
       x = "School ID",
       y = "Mean Math Score",
       color = "Class Type") + 
  scale_color_manual(values = c('#C8384A','#008059','#EC7D33')) +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We observe that the three lines representing the class types are approximately parallel, indicating no particular interaction effect between class type and school ID. This further supports the choice of the two-way ANOVA model without the interaction term.

5.4 Model Fitting and Results

5.4.1 Fitting the Two-Way ANOVA Model

# Fit the two-way ANOVA model

anova_data <- data %>%
  dplyr::select(math1, star1, schoolid1) %>%
  group_by(star1, schoolid1) %>%
  summarise(mean_math1 = mean(math1, na.rm = TRUE)) %>%
  ungroup()

aov1 <- aov(mean_math1 ~ star1 + schoolid1, data = anova_data)
model_summary <- tidy(aov1)
model_coefficients <- aov1$coefficients

kable(model_summary, caption = "Table 12: Summary of Two-Way ANOVA Model") %>%
  kable_styling(full_width = FALSE)
Table 12: Summary of Two-Way ANOVA Model
term df sumsq meansq statistic p.value
star1 2 6498.885 3249.4427 16.970995 2e-07
schoolid1 75 92549.606 1233.9947 6.444834 0e+00
Residuals 146 27954.674 191.4704 NA NA
kable(model_coefficients, col.names = c("Term", "Coefficient"), caption = "Table 13: Coefficients of Two-Way ANOVA Model") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  scroll_box(height = '350px')
Table 13: Coefficients of Two-Way ANOVA Model
Term Coefficient
(Intercept) 536.3152738
star1small 12.2436323
star1regular+aide 1.0328096
schoolid12 -46.4955656
schoolid13 19.3623114
schoolid14 -20.0255109
schoolid15 -23.7106002
schoolid17 11.1501721
schoolid18 -11.4208778
schoolid19 10.4710144
schoolid110 29.3934447
schoolid111 9.0281964
schoolid112 -5.6740878
schoolid113 13.4610974
schoolid114 -14.1688246
schoolid115 -40.6870900
schoolid116 -33.2553531
schoolid117 -39.5429710
schoolid119 -49.1398018
schoolid120 -9.8498814
schoolid121 -37.5891134
schoolid122 -38.8812306
schoolid123 2.5432131
schoolid124 -0.8375447
schoolid125 -34.4582320
schoolid126 -52.8325445
schoolid127 -18.5300707
schoolid128 -42.1042852
schoolid129 5.4404928
schoolid130 -50.3786629
schoolid131 -22.1971234
schoolid132 -43.7283148
schoolid133 -28.9380938
schoolid134 -17.2533148
schoolid135 -25.0634623
schoolid136 -1.9733748
schoolid137 -3.1890555
schoolid138 -36.8154155
schoolid139 0.8598628
schoolid140 0.5427169
schoolid141 21.3174893
schoolid143 3.7910217
schoolid144 -20.1778553
schoolid145 -4.8020290
schoolid146 -18.3259341
schoolid147 -15.4939596
schoolid148 -21.7937078
schoolid149 -12.3468798
schoolid150 -18.6521488
schoolid151 -7.5506357
schoolid152 13.4060710
schoolid153 -13.8063915
schoolid154 0.7333495
schoolid155 -15.0926063
schoolid156 -32.8001663
schoolid157 -14.3560716
schoolid158 5.7666494
schoolid159 7.6228819
schoolid160 -29.3690613
schoolid161 14.6062570
schoolid162 -10.4041816
schoolid163 7.4878145
schoolid164 2.3400536
schoolid165 2.7528618
schoolid166 -11.1278299
schoolid167 -7.8160610
schoolid168 3.6509272
schoolid169 24.5625582
schoolid170 -0.2149951
schoolid171 -16.4442204
schoolid172 28.2014796
schoolid173 25.9606619
schoolid174 32.1481168
schoolid175 -9.1795033
schoolid177 15.0848184
schoolid178 -12.3518656
schoolid179 -4.1099932
schoolid180 0.6786317

The fitted model can be interpreted as follows:

  • Intercept = 536.32: Represents the mean math score for the reference group, which in this case happens to be the regular class type at school ID 1. This is the expected math score when both predictors are at their reference levels.
  • Class Type (star1) coefficients:
    • Small = 12.24: Students in small-sized classes scored 12.24 points higher than those in small classes (the reference category)
    • Regular with Aide = 1.03: Students in regular classes with a teacher aide scored only 1.03 points higher than those in regular classes.
  • School ID (schoolid1) coefficients (sample):
    • School ID 2 = -46.50: The expected math score for students at school ID 2 is 46.50 points lower than the reference school ID 1.
    • School ID 3 = 19.36: The expected math score for students at school ID 3 is 19.36 points higher than the reference school ID 1.

5.5 F-Test

We can answer the primary question of interest by conducting an F-test to determine if there are significant differences in math scaled scores across class types. The null and alternative hypotheses are as follows:

  • \(H_0: \alpha_1 = \alpha_2 = \alpha_3 = 0\) (No significant differences in math scaled scores across class types)
  • \(H_A: \text{At least one } \alpha_i \neq 0\) (Significant differences in math scaled scores across class types)

Assumptions for the F-test include the normality of residuals and homoscedasticity, which remain the saem as the two-way ANOVA model.

# F-test for class type
kable(Anova(lm(mean_math1 ~ star1 + schoolid1, data = anova_data), type = 2), caption = "Table 14: F-Test for Class Type") %>%
  kable_styling(full_width = FALSE)
Table 14: F-Test for Class Type
Sum Sq Df F value Pr(>F)
star1 6939.552 2 18.121737 1e-07
schoolid1 92549.606 75 6.444834 0e+00
Residuals 27954.674 146 NA NA

The F-test results indicate that the p-value for class type (star1) is less than 0.05, suggesting that there are significant differences in math scaled scores across class types. We reject the null hypothesis and conclude that at least one class type has a significantly different mean math score compared to the others.

5.6 Tukey’s Range Test

To further investigate the differences in math scaled scores across class types, we can conduct Tukey’s range test to identify which class types are significantly different from each other. Tukey’s range test was chosen due to its ability to compare all pairs of class types and control the family-wise error rate.

# Tukey's range test for class type
aov_tukey <- aov(mean_math1 ~ star1, data = anova_data)
# Perform Tukey's HSD test
tukey_results <- TukeyHSD(aov_tukey)

# Display results in a table
kable(tidy(tukey_results), caption = "Table 15: Tukey's Range Test for Class Type") %>%
  kable_styling(full_width = FALSE)
Table 15: Tukey’s Range Test for Class Type
term contrast null.value estimate conf.low conf.high adj.p.value
star1 small-regular 0 12.243632 3.305346 21.181919 0.0040311
star1 regular+aide-regular 0 2.069144 -6.992436 11.130723 0.8522804
star1 regular+aide-small 0 -10.174489 -19.236068 -1.112909 0.0234132
tukey_df <- as.data.frame(tukey_results$star1)
tukey_df$Comparison <- rownames(tukey_df)

# Rename columns for clarity
colnames(tukey_df) <- c("Difference", "Lower_CI", "Upper_CI", "p_value", "Comparison")

ggplot(tukey_df, aes(x = reorder(Comparison, Difference), y = Difference)) +
  geom_point(color = "#263238", size = 3) +  # Mean difference points
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2, color = "#607D8B") +  # Confidence intervals
  geom_hline(yintercept = 0, linetype = "dashed", color = "#C8384A") +  # Line at y=0 for reference
  labs(title = "Tukey HSD Test: Math Score Differences by Class Type",
       x = "Class Type Comparison",
       y = "Mean Difference in Math Scores (95% CI)") +
  coord_flip() +  # Flip for readability
  theme_minimal()

The Tukey’s range test results can be interpreted as follows:

  • Small vs. Regular: The confidence interval does not cross zero, indicating that the mean math score for students in small classes is significantly different from those in regular classes (students in small classes scored higher).
  • Regular with Aide vs. Regular: The confidence interval crosses zero, suggesting no statistically significant difference between regular classes and regular classes with an aide. Having an aide in regular classes does not significantly impact math scores.
  • Regular with Aide vs. Small: The confidence interval does not cross zero, indicating that the mean math score for students in regular classes with an aide is significantly different from those in small classes (students in small classes scored higher).

Overall, we find that students in small classes have significantly higher math scores compared to students in regular classes and regular classes with an aide. However, there is no significant difference between regular classes and regular classes with an aide.

6 Sensitivity analysis

This section will run several tests to verify the assumptions of the two-way ANOVA model.

6.1 Independence Assumption

We noted that the residuals \(\epsilon_{ijk}\) should be independent of each other. We can examine the residuals to check for any patterns or autocorrelation that may violate this assumption.

# Residual plot of the fitted model

residuals <- residuals(aov1)

ggplot(anova_data, aes(x = fitted(aov1), y = residuals)) +
  geom_point(color = "#263238", size = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "#C8384A") +
  labs(title = "Residual Plot of the Fitted Model",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

The residual plot shows no clear pattern or trend, indicating that the residuals are randomly distributed around zero. This suggests that the independence assumption is plausible.

6.2 Normality Assumption

We also assumed that the residuals \(\epsilon_{ijk}\) are normally distributed. We can examine the normality of the residuals using a Q-Q plot.

# Q-Q plot of standardized residuals
ggplot(data.frame(standardized_residuals = rstandard(aov1)), aes(sample = rstandard(aov1))) +
  stat_qq(size = 1, color = "#263238") +
  stat_qq_line(linetype = "dashed", color = "#C8384A") +
  labs(title = "Q-Q Plot of Standardized Residuals",
       x = "Theoretical Quantiles",
       y = "Standardized Residuals") +
  theme_minimal()

The Q-Q plot shows that the standardized residuals closely follow the theoretical quantiles, indicating that the normality assumption is plausible. We do however see a slight deviation from normality in the tails of the distribution, where we could be seeing slight skewness or kurtosis. We can further investigate this by running a Shapiro-Wilk test.

6.2.1 Shapiro-Wilk Test

The Shapiro-Wilk test is a statistical test of the null hypothesis that the data was drawn from a normal distribution. We can use this test to formally assess the normality of the residuals. The hypotheses for the Shapiro-Wilk test are as follows:

  • \(H_0\): The residuals are normally distributed
  • \(H_A\): The residuals are not normally distributed
# Shapiro-Wilk test for normality
shapiro_test <- shapiro.test(residuals)
shapiro_test
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.98598, p-value = 0.02642

The p-value of the Shapiro-Wilk test is \(0.026 < 0.05\), indicating that we reject the null hypothesis of normality. This suggests that the residuals are not normally distributed, which may be due to the slight deviations observed in the Q-Q plot. We should note that the Shapiro-Wilk test is sensitive to large sample sizes, and even minor deviations from normality can lead to a significant result.

6.2.2 Box-Cox Transformation

To address the non-normality of the residuals, we can apply a Box-Cox transformation to the response variable (math scores) to stabilize the variance and make the residuals more normally distributed.

# Box-Cox transformation for lambda selection
boxcox_model <- boxcox(aov1, lambda = seq(-2, 2, by = 0.1))

# Identify the optimal lambda (where log-likelihood is maximized)
optimal_lambda <- boxcox_model$x[which.max(boxcox_model$y)]
cat("Optimal lambda:", optimal_lambda)
## Optimal lambda: -1.676768

The optimal lambda value from the Box-Cox transformation is -1.6768, which we can apply to the math scores to transform the data and re-fit the model.

data$math1_transformed <- (data$math1^optimal_lambda - 1) / optimal_lambda

anova_transformed_data <- data %>%
  group_by(star1, schoolid1) %>%
  summarise(mean_math1 = mean(math1_transformed, na.rm = TRUE))

aov_transformed <- aov(mean_math1 ~ star1 + schoolid1, data = anova_transformed_data)

model_summary <- tidy(aov_transformed)
model_coefficients <- aov_transformed$coefficients

kable(model_summary, caption = "Table 16: Summary of Two-Way ANOVA Model (Transformed, lambda=-1.6768)") %>%
  kable_styling(full_width = FALSE)
Table 16: Summary of Two-Way ANOVA Model (Transformed, lambda=-1.6768)
term df sumsq meansq statistic p.value
star1 2 0 0 16.532573 3e-07
schoolid1 75 0 0 6.632094 0e+00
Residuals 146 0 0 NA NA

We notice a no-variation error when fitting the transformed model, so we try a transformation with \(\lambda = -1\).

optimal_lambda <- -1

data$math1_transformed <- (data$math1^optimal_lambda - 1) / optimal_lambda

anova_transformed_data <- data %>%
  group_by(star1, schoolid1) %>%
  summarise(mean_math1 = mean(math1_transformed, na.rm = TRUE))

aov_transformed <- aov(mean_math1 ~ star1 + schoolid1, data = anova_transformed_data)

model_summary <- tidy(aov_transformed)
model_coefficients <- aov_transformed$coefficients

kable(model_summary, caption = "Table 17: Summary of Two-Way ANOVA Model (Transformed, lambda=-1)") %>%
  kable_styling(full_width = FALSE)
Table 17: Summary of Two-Way ANOVA Model (Transformed, lambda=-1)
term df sumsq meansq statistic p.value
star1 2 1.0e-07 0 16.705837 3e-07
schoolid1 75 1.2e-06 0 6.608896 0e+00
Residuals 146 3.0e-07 0 NA NA

The p-value for class type star1 is less than 0.05, indicating that there are significant differences in math scaled scores across class types. Since the transformed model results align with the original model, we can conclude that the Box-Cox transformation did not significantly impact the results.

6.3 Homoscedasticity Assumption

To verify the homoscedasticity assumption, we can conduct a formal test using the Breusch-Pagan test. The Breusch-Pagan test performs an auxiliary regression of the squared residuals on the independent variables to test for homoscedasticity. The hypotheses for the Breusch-Pagan test are as follows:

  • \(H_0\): The residuals are homoscedastic
  • \(H_A\): The residuals are heteroscedastic
# Breusch-Pagan test for homoscedasticity
breusch_pagan_test <- bptest(aov1)
breusch_pagan_test
## 
##  studentized Breusch-Pagan test
## 
## data:  aov1
## BP = 126.17, df = 77, p-value = 0.0003498

The p-value of the Breusch-Pagan test is \(0.00035 < 0.05\), indicating that we reject the null hypothesis of homoscedasticity. This suggests that the residuals do not have constant variance across all levels of the independent variables. This could mostly be due to the variability in math scores across different schools. Since our primary question of interest is focused on class types, the violation of the homoscedasticity assumption may not significantly impact our conclusions, but it is important to note this limitation.

6.4 Alternative Method: Median Instead of Mean

To explore an alternative method, we can calculate the median math scores for each class type and compare the results to the mean math scores.

anova_data_median <- data %>%
  group_by(star1, schoolid1) %>%
  summarise(median_math1 = median(math1, na.rm = TRUE)) %>%
  ungroup()


# Histogram for Mean Math Scores
p1 <- ggplot(anova_data, aes(x = mean_math1)) +
  geom_histogram(fill = "#00D29A", color = "#263238", alpha = 0.6, bins = 30) +
  labs(title = "Histogram of Mean Math Scores", x = "Math Score", y = "Frequency") +
  theme_minimal()

# Histogram for Median Math Scores
p2 <- ggplot(anova_data_median, aes(x = median_math1)) +
  geom_histogram(fill = "#AFAAFF", color = "#263238", alpha = 0.6, bins = 30) +
  labs(title = "Histogram of Median Math Scores", x = "Math Score", y = "Frequency") +
  theme_minimal()

# Display the two histograms side by side
p1 | p2

aov_median <- aov(median_math1 ~ star1 + schoolid1, data = anova_data_median)

model_summary_median <- tidy(aov_median)
model_coefficients_median <- aov_median$coefficients

kable(model_summary_median, caption = "Table 18: Summary of Two-Way ANOVA Model using Median Math Scores") %>%
  kable_styling(full_width = FALSE)
Table 18: Summary of Two-Way ANOVA Model using Median Math Scores
term df sumsq meansq statistic p.value
star1 2 7513.019 3756.5096 16.201372 4e-07
schoolid1 75 103815.313 1384.2042 5.969905 0e+00
Residuals 146 33852.096 231.8637 NA NA
kable(model_coefficients_median, col.names = c("Term", "Coefficient"), caption = "Table 19: Coefficients of Two-Way ANOVA Model using Median Math Scores") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  scroll_box(height = '350px')
Table 19: Coefficients of Two-Way ANOVA Model using Median Math Scores
Term Coefficient
(Intercept) 540.2674829
star1small 11.9210526
star1regular+aide -1.7235015
schoolid12 -52.3333333
schoolid13 12.8333333
schoolid14 -19.5000000
schoolid15 -32.3333333
schoolid17 8.0000000
schoolid18 -13.6666667
schoolid19 9.3333333
schoolid110 35.5000000
schoolid111 3.0000000
schoolid112 -0.6666667
schoolid113 7.1666667
schoolid114 -18.6666667
schoolid115 -47.7280093
schoolid116 -33.3333333
schoolid117 -43.3333333
schoolid119 -52.8333333
schoolid120 -12.5000000
schoolid121 -44.1666667
schoolid122 -46.3333333
schoolid123 -3.7280093
schoolid124 -4.3333333
schoolid125 -36.8333333
schoolid126 -60.9780093
schoolid127 -23.8333333
schoolid128 -47.0000000
schoolid129 -3.0000000
schoolid130 -52.1666667
schoolid131 -21.2280093
schoolid132 -53.6666667
schoolid133 -33.5000000
schoolid134 -17.8333333
schoolid135 -28.8333333
schoolid136 -8.6666667
schoolid137 -7.0000000
schoolid138 -47.0000000
schoolid139 4.3333333
schoolid140 -1.1666667
schoolid141 17.0000000
schoolid143 7.0000000
schoolid144 -22.1666667
schoolid145 -6.1666667
schoolid146 -22.0000000
schoolid147 -21.1666667
schoolid148 -25.3333333
schoolid149 -15.8333333
schoolid150 -22.0000000
schoolid151 -10.3333333
schoolid152 11.1666667
schoolid153 -22.1666667
schoolid154 -4.6666667
schoolid155 -26.5000000
schoolid156 -42.6666667
schoolid157 -21.0000000
schoolid158 4.5000000
schoolid159 5.1666667
schoolid160 -33.8333333
schoolid161 4.0000000
schoolid162 -16.6666667
schoolid163 8.5000000
schoolid164 -3.1666667
schoolid165 -3.0000000
schoolid166 -12.1666667
schoolid167 -5.3333333
schoolid168 -1.3333333
schoolid169 26.5000000
schoolid170 -5.3333333
schoolid171 -19.5000000
schoolid172 26.6666667
schoolid173 19.6666667
schoolid174 30.3333333
schoolid175 -15.6666667
schoolid177 9.3333333
schoolid178 -14.6666667
schoolid179 -5.3333333
schoolid180 -7.6666667

The histograms of the mean and median math scores show very similar distributions, indicating that the choice of summary measure (mean vs. median) may result in similar conclusions. The two-way ANOVA model using median math scores verifies this result, with the p-value for class type being less than 0.05, indicating significant differences in math scaled scores across class types. This result aligns with the original model using mean math scores. Therefore, the choice of summary measure does not significantly impact the conclusions of the analysis in the context of this particular two-way ANOVA model.

The conclusions for the second question of interest (differences in math scaled scores across class types) remain consistent with the original model using mean math scores.

# Tukey's range test for class type
aov_tukey <- aov(median_math1 ~ star1, data = anova_data_median)
# Perform Tukey's HSD test
tukey_results <- TukeyHSD(aov_tukey)

# Display results in a table
kable(tidy(tukey_results), caption = "Table 20: Tukey's Range Test for Class Type (Median-based Model)") %>%
  kable_styling(full_width = FALSE)
Table 20: Tukey’s Range Test for Class Type (Median-based Model)
term contrast null.value estimate conf.low conf.high adj.p.value
star1 small-regular 0 11.9210526 2.367418 21.474687 0.0099853
star1 regular+aide-regular 0 -0.6154971 -10.300912 9.069918 0.9876810
star1 regular+aide-small 0 -12.5365497 -22.221965 -2.851135 0.0071240
tukey_df <- as.data.frame(tukey_results$star1)
tukey_df$Comparison <- rownames(tukey_df)

# Rename columns for clarity
colnames(tukey_df) <- c("Difference", "Lower_CI", "Upper_CI", "p_value", "Comparison")

ggplot(tukey_df, aes(x = reorder(Comparison, Difference), y = Difference)) +
  geom_point(color = "#263238", size = 3) +  # Mean difference points
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2, color = "#607D8B") +  # Confidence intervals
  geom_hline(yintercept = 0, linetype = "dashed", color = "#C8384A") +  # Line at y=0 for reference
  labs(title = "Tukey HSD Test: Math Score Differences by Class Type (Median-based Model)",
       x = "Class Type Comparison",
       y = "Mean Difference in Math Scores (95% CI)") +
  coord_flip() +  # Flip for readability
  theme_minimal()

Model assumptions for the median-based model are the same as the mean-based model, and the results are consistent with the original model.

# Residuals plot for median-based model
residuals_median <- residuals(aov_median)

p7 <- ggplot(anova_data_median, aes(x = fitted(aov_median), y = residuals_median)) +
  geom_point(color = "#263238", size = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "#C8384A") +
  labs(title = "Residual Plot of the Median-based Model",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()



# Q-Q plot of standardized residuals for median-based model
p8 <- ggplot(data.frame(standardized_residuals = rstandard(aov_median)), aes(sample = rstandard(aov_median))) +
  stat_qq(size = 1, color = "#263238") +
  stat_qq_line(linetype = "dashed", color = "#C8384A") +
  labs(title = "Q-Q Plot of Residuals (Median-based Model)",
       x = "Theoretical Quantiles",
       y = "Standardized Residuals") +
  theme_minimal()

p7 | p8

# Shapiro-Wilk test for normality of residuals (median-based model)
shapiro_test_median <- shapiro.test(residuals_median)
shapiro_test_median
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals_median
## W = 0.99338, p-value = 0.4202

However we notice that for the median-based model, we fail to reject the null hypothesis of normality, indicating that the residuals are normally distributed. For downstream analysis, if our question(s) of interest deals with more volatile data (more variability), this result suggests that a median-based model could be more appropriate.

7 Discussion

Acknowledgement

I acknowledge and thank my classmates Hangyu Li and Shang Chen for their helpful discussions and sharing unique approaches to the analysis.

Reference

Achilles, C. M. (2012, October). Class-size policy: The STAR experiment and related class-size studies (NCPEA Policy Brief, Vol. 1, No. 2). NCPEA Publications. ERIC Number: ED540485.

Session info

Report information of your R session for reproducibility.

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] AER_1.2-14       survival_3.6-4   sandwich_3.1-1   lmtest_0.9-40   
##  [5] zoo_1.8-12       broom_1.0.7      MASS_7.3-60.2    car_3.1-3       
##  [9] carData_3.0-5    patchwork_1.3.0  kableExtra_1.4.0 knitr_1.48      
## [13] ggthemes_5.1.0   ggplot2_3.5.1    dplyr_1.1.4     
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    tidyr_1.3.1      
##  [5] xml2_1.3.6        lattice_0.22-6    stringi_1.8.4     digest_0.6.37    
##  [9] magrittr_2.0.3    evaluate_1.0.0    grid_4.4.1        fastmap_1.2.0    
## [13] Matrix_1.7-0      jsonlite_1.8.9    backports_1.5.0   Formula_1.2-5    
## [17] purrr_1.0.2       fansi_1.0.6       viridisLite_0.4.2 scales_1.3.0     
## [21] jquerylib_0.1.4   abind_1.4-8       cli_3.6.3         crayon_1.5.3     
## [25] rlang_1.1.4       splines_4.4.1     munsell_0.5.1     withr_3.0.1      
## [29] cachem_1.1.0      yaml_2.3.10       tools_4.4.1       colorspace_2.1-1 
## [33] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   stringr_1.5.1    
## [37] pkgconfig_2.0.3   pillar_1.9.0      bslib_0.8.0       gtable_0.3.5     
## [41] glue_1.8.0        systemfonts_1.2.1 highr_0.11        xfun_0.47        
## [45] tibble_3.2.1      tidyselect_1.2.1  rstudioapi_0.17.1 farver_2.1.2     
## [49] htmltools_0.5.8.1 labeling_0.4.3    rmarkdown_2.28    svglite_2.1.3    
## [53] compiler_4.4.1